ArmSpeech: Armenian Spoken Language Corpus

نویسندگان

چکیده

The Armenian language is an independent branch of the Indo-European family and official Republic Armenia Artsakh. According to various reliable sources, average 3 million people in 10-12 Diaspora use as their native language. largest communities outside are United States America, Canada, Russian Federation, Islamic Iran, French Republic, Syrian Arab Lebanese Republic. This paper presents ArmSpeech speech corpus. a collection annotated intended for natural processing (NLP) technologies research development. designed speech-to-text text-to-speech purposes but can be used other domains also (e.g. identification). Corpus contains 6206 high-quality audio samples: 11 hours 46 minutes 26 seconds (11.77 hours) multiple speakers any age, gender accent. results, this most extensive corpus public domain recognition, synthesis spoken identification systems.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Corpus of Spoken Slovak Language

In this paper a short description of activities towards building a general speech corpus of spoken Slovak language is given. Different rôles and specific features of text corpus and speech corpus are investigated as well as the most frequent mistakes and misunderstandings of the concept of a speech corpus are mentioned. The concept of a big representative corpus of spoken language and its desir...

متن کامل

Spoken language corpus for machine interpretation research

This paper describes a database consisting of speech and language, which we are currently constructing for the purpose of the research on machine interpretation. The database contains bilingual data of lectures and dialogues. We have collected the speech of about 72 hours in total and transcribed it into the text manually. We have investigated the database in order to acquire empirical knowledg...

متن کامل

Spoken language identification using the speechdat corpus

Current language identification systems vary significantly in their complexity. The systems that use higher level linguistic information have the best performance. Nevertheless, that information is hard to collect for each new language. The system presented in this paper is easily extendable to new languages because it uses very little linguistic information. In fact, the presented system needs...

متن کامل

The ATIS Spoken Language Systems Pilot Corpus

Speech research has made tremendous progress in the past using the following paradigm: de ne the research problem, collect a corpus to objectively measure progress, and solve the research problem. Natural language research, on the other hand, has typically progressed without the bene t of any corpus of data with which to test research hypotheses. We describe the Air Travel Information System (A...

متن کامل

A corpus-centered approach to spoken language translation

This paper reports the latest performance of components and features of a project named CorpusCentered Computation (C'3), which targets a translation technology suitable for spoken language translation. C3 places corpora at the center of the technology. Translation knowledge is extracted from corpora by both EBMT and SMT methods, translation quality is gauged by referring to corpora, the best t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: International journal of scientific advances

سال: 2022

ISSN: ['2708-7972']

DOI: https://doi.org/10.51542/ijscia.v3i3.25